Regression using M-estimators and polynomial kernel support vector machines and principal component regression

ABSTRACT

Embodiments of the invention relate to sketching for M-estimators for performing regression. One embodiment includes providing one or more sets of input data. A matrix A and a vector b are generated using the input data. A processor device is used for processing the matrix A and the vector b based on a randomized sketching matrix S. A vector x that minimizes a normalized measure function is determined based on the matrix A and the vector b. A relationship between the input data is determined based on the vector x.

This invention was made with Government support under FA8750-12-C-0323P00003 awarded by Defense Advanced Research Projects Agency (DARPA). TheGovernment has certain rights in this invention.

BACKGROUND

Embodiments of the invention relate to regression processing and, inparticular, sketching for M-estimators for performing regression andproviding an oblivious subspace embedding (OSE) for embedding a spaceinduced by a nonlinear kernel.

Linear regression is widely used in biological, behavioral and socialsciences to describe possible relationships between variables. It ranksas one of the most important tools used in these disciplines. Regressionprocessing sometimes involves the tool of sketching, which in generalityis a descendent of random projection methods. Sketching has emerged as apowerful dimensionality reduction technique for accelerating statisticallearning techniques such as l_(p)-regression, low rank approximation,principal component analysis (PCA), and support vector machines. Fornatural settings of parameters, the sketching technique has led toasymptotically optimal algorithms for a number of applications, oftenproviding a speedup of order of magnitude over slower exact algorithmsgiven by, e.g., the singular value decomposition (SVD). Oblivioussubspace embedding (OSE) is essentially a data-independent randomtransform which is an approximate isometry over a subspace. It iscrucial for any reasonable use of an OSE that applying it to a vector ora collection of vectors (a matrix) can be done in time that is fasterthan that of the nave algorithm, or at least faster than the intendeddownstream use.

Conventional OSEs are for subspaces that have a representation as thecolumn space of an explicitly provided matrix, or close variants of it,and they admit a fast multiplication given an explicit representation ofa vector or matrix (either dense or sparse). This is quiteunsatisfactory in many statistical learning settings. In many cases theinput may be described by a moderately sized n-by-d sample-by-featurematrix A, but the actual learning is done in a much higher (possiblyinfinite) dimensional space, by mapping each row of A to the higherdimensional feature space.

BRIEF SUMMARY

Embodiments of the invention relate to sketching for M-estimators forperforming regression. One embodiment includes a method that includesproviding one or more sets of input data. A matrix A and a vector b aregenerated using the input data. A processor device is used forprocessing the matrix A and the vector b based on a randomized sketchingmatrix S. A vector x that minimizes a normalized measure function isdetermined based on the matrix A and the vector b. A relationshipbetween the input data is determined based on the vector x.

These and other features, aspects and advantages of the presentinvention will become understood with reference to the followingdescription, appended claims and accompanying figures.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example of a networkenvironment for sketching for M-estimators for performing regression andOSE embedding, according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating an example of a server includinga system utilizing sketching for M-estimators for performing regressionand OSE embedding, according to an embodiment of the present invention,as shown in FIG. 1;

FIG. 3 illustrates a block diagram of an example computing device forsketching for M-estimators for performing regression, in accordance withan embodiment of the invention;

FIG. 4 is a block diagram showing a process for sketching forM-estimators for performing regression, in accordance with an embodimentof the invention; and

FIG. 5 is a block diagram showing a process for sketching using an OSEfor embedding a space induced by a nonlinear kernel, in accordance withan embodiment of the invention.

DETAILED DESCRIPTION

Aspects of the present invention are described below with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products, according to embodiments ofthe invention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

FIG. 1 illustrates an example of the basic components of an informationtechnology system 10 utilizing system 100, according to an embodiment ofthe present invention. The information technology system 10 includes aserver 11 and remote devices 15 and 17-20 that may utilize the system100 of the present invention. In one embodiment, the server 11implements the system 100 of the present invention.

Each of the remote devices 15 and 17-20 has applications and can have alocal database 16. Server 11 contains applications, and is connected toa database 12 that can be accessed by remote device 15 and 17-20 viaconnections 14(A-F), respectively, over a network 13. The server 11executes software for a computer network and controls access to itselfand database 12. The remote device 15 and 17-20 may access the database12 over the network 13, such as but not limited to: the Internet, alocal area network (LAN), a wide area network (WAN), via a telephoneline using a modem (POTS), Bluetooth, WiFi, WiMAX, cellular, optical,satellite, RF, Ethernet, magnetic induction, coax, RS-485, the like orother like networks. The server 11 may also be connected to the localarea network (LAN) within an organization.

The remote device 15 and 17-20 may each be located at remote sites.Remote device 15 and 17-20 include but are not limited to, PCs,workstations, laptops, handheld computers, pocket PCs, PDAs, pagers, WAPdevices, non-WAP devices, cell phones, palm devices, printing devices,and the like. Included with each remote device 15 and 17-20 is anability to request relevant material from a large collection ofdocuments via search queries to the server 11. Thus, when a user at oneof the remote devices 15 and 17-20 desires to access the system 100 andthe database 12 at the server 11, the remote device 15 and 17-20communicates over the network 13, to access the system 100, the server11 and database 12.

Third party computer systems 21 and databases 22 can be accessed by theserver 11 in order to provide access to additional collections ofdocuments and/or search indexes. Data that is obtained from third partycomputer systems 21 and database 22 can be stored on server 11 anddatabase 12 in order to provide later access to the user on remotedevices 15 and 17-20. It is also contemplated that for certain types ofdata, the remote devices 15 and 17-20 can access the third partycomputer systems 21 and database 22 directly using the network 13.

In one or more embodiments, the system 100 utilizes a process forsketching for M-estimators for performing regression and/or OSEembedding. One or more embodiments provide processing for theM-estimators of min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by a cost or measure function G:

⁺, with ∥y∥_(G)≡Σ_(i) G(y_(i)), where A is a matrix (n×d), b, x (e.g.,explanatory, unknown) and y (e.g., dependent, measurements, samples,etc.) are vectors, and i, n and i are positive integers. M-estimatorsgeneralize l_(p) (e.g., least squares, robust, etc.) regression, forwhich G(x)=|x|^(p), where p ε[1, ∞]. In one embodiment, M-sketch isintroduced, which includes a randomized construction of a sketchingmatrix S so that SAx may be used to estimate ∥Ax∥_(G), for all xε

^(a). This construction yields a single sketching process for allM-estimators requiring O(nnz(A)+poly(d log n)) time (where nnz isnon-zero entries of the matrix A) and making one pass over the matrix inorder to find a solution whose residual error is within a constantfactor of optimal (e.g., as selected), with constant probability. In oneexample embodiment, an M-estimators using a Huber measure can becomputed up to relative error ε in O(nnz(A)log n+poly(d(log n)/ε)) time.One or more embodiments provide improvements for subspace embeddings forl₁ (robust regression), and for dimensionality reduction from l₁ intol₁.

In one embodiment, system 100 provides fast oblivious subspaceembeddings for spaces induced by a non-linear kernel without explicitlymapping the data to the high-dimensional space. In one example, OSEs formappings induced by the polynomial kernel are implemented. Thepolynomial kernel is an important kernel often used in, e.g., naturallanguage processing. In other examples, popular kernels such as theGaussian kernel may sometimes be approximated by the polynomial kernelusing a Taylor series. In one embodiment, the implemented OSEs may beused to obtain faster algorithms for the polynomial kernel. Namely,faster processes are obtained for kernel support vector machines (VMs),low rank approximation and approximate kernel principal componentanalysis (PCA), and approximate principal component regression (PCR).

One or more embodiments may employ the system 100 for regressionapplications involving, biological, behavioral and social sciences,trend estimations, epidemiology, finance, economics, environmentalscience, astronomy, bioinformics, medicine, engineering, control theory,image analysis, computer science, Internet analysis, data analysis, etc.

Illustrated in FIG. 2 is a block diagram demonstrating an example ofserver 11, as shown in FIG. 1, utilizing the system 100 according to anembodiment of the present invention. The server 11 includes, but is notlimited to, PCs, workstations, laptops, PDAs, palm devices, and thelike. The processing components of the third party computer systems aresimilar to that of the description for the server 11 (FIG. 2).

Generally, in terms of hardware architecture, as shown in FIG. 2, theserver 11 includes a processor 41, a computer readable medium such asmemory 42, and one or more input and/or output (I/O) devices (orperipherals) that are communicatively coupled via a local interface 43.The local interface 43 can be, for example but not limited to, one ormore buses or other wired or wireless connections, as is known in theart. The local interface 43 may have additional elements, which areomitted for simplicity, such as controllers, buffers (caches), drivers,repeaters, and receivers to enable communications. Further, the localinterface 43 may include address, control, and/or data connections toenable appropriate communications among the aforementioned components.

The processor 41 is a hardware device for executing software that can bestored in memory 42. The processor 41 can be virtually any custom madeor commercially available processor, a central processing unit (CPU),data signal processor (DSP) or an auxiliary processor among severalprocessors associated with the server 11, and a semiconductor basedmicroprocessor (in the form of a microchip) or a microprocessor.

The memory 42 can include any one or combination of volatile memoryelements (e.g., random access memory (RAM), such as dynamic randomaccess memory (DRAM), static random access memory (SRAM), etc.) andnonvolatile memory elements (e.g., read only memory (ROM), erasableprogrammable read only memory (EPROM), electronically erasableprogrammable read only memory (EEPROM), programmable read only memory(PROM), tape, compact disc read only memory (CD-ROM), disk, diskette,cartridge, cassette or the like, etc.). Moreover, the memory 42 mayincorporate electronic, magnetic, optical, and/or other types of storagemedia. Note that the memory 42 can have a distributed architecture,where various components are situated remote from one another, but canbe accessed by the processor 41.

The software in memory 42 may include one or more separate programs,each of which comprises an ordered listing of executable instructionsfor implementing logical functions. In the example illustrated in FIG.2, the software in the memory 42 includes a suitable operating system(O/S) 51 and the search system 100 of the present invention. The system100 comprises functional components and process blocks described furtherbelow.

The operating system 51 essentially controls the execution of othercomputer programs, such as the system 100, and provides scheduling,input/output control, file and data management, memory management, andcommunication control and related services. However, the system 100 ofthe present invention is applicable on all other commercially availableoperating systems.

The system 100 may comprise a source program, executable program (objectcode), script, or any other entity comprising a set of computer programinstructions to be performed. When the system 100 is a source program,then the program is usually translated via a compiler, assembler,interpreter, or the like, which may or may not be included within thememory 42, so as to operate properly in connection with the O/S 51.Furthermore, the system 100 can be written as (a) an object orientedprogramming language, which has classes of data and methods, or (b) aprocedure programming language, which has routines, subroutines, and/orfunctions. The computer program instructions may execute entirely onserver 11, partly on the server 11, as a stand-alone software package,partly on server 11 and partly on a remote computer or entirely on theremote computer or server. In the latter scenario, the remote computermay be connected to the user's computer through any type of network,including a local area network (LAN) or a wide area network (WAN), orthe connection may be made to an external computer (for example, throughthe Internet using an Internet Service Provider).

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The I/O devices may include input devices, for example but not limitedto, a mouse 44, keyboard 45, scanner (not shown), microphone (notshown), etc. Furthermore, the I/O devices may also include outputdevices, for example but not limited to, a printer (not shown), display46, etc. Finally, the I/O devices may further include devices thatcommunicate both inputs and outputs, for instance but not limited to, aNIC or modulator/demodulator 47 (for accessing remote devices, otherfiles, devices, systems, or a network), a radio frequency (RF) or othertransceiver (not shown), a telephonic interface (not shown), a bridge(not shown), a router (not shown), etc.

If the server 11 is a PC, workstation, intelligent device or the like,the software in the memory 42 may further include a basic input outputsystem (BIOS) (omitted for simplicity). The BIOS is a set of essentialsoftware routines that initialize and test hardware at startup, startsthe O/S 51, and supports the transfer of data among the hardwaredevices. The BIOS is stored in some type of read-only-memory, such asROM, PROM, EPROM, EEPROM or the like, so that the BIOS can be executedwhen the server 11 is activated.

When the server 11 is in operation, the processor 41 is configured toexecute software stored within the memory 42, to communicate data to andfrom the memory 42, and generally to control operations of the server 11pursuant to the software. The system 100 and the O/S 51 are read, inwhole or in part, by the processor 41, perhaps buffered within theprocessor 41, and then executed.

FIG. 3 shows an implementation of system 100 as a computing device 300for sketching for M-estimators for performing regression and providingan oblivious subspace embedding (OSE) for embedding a space induced by anonlinear kernel. In one embodiment, server 300 comprises a storagemodule 310, a matrix processing module 320, a vector processing module330, a sketching module 340 and a regression process 350. In oneembodiment, computing device 300 may extend and improve on O(nnz(A))sketching results using the sketching module 340, matrix processingmodule 320, vector processing module 330 and the regression process 350by giving a single sketching matrix construction, which is referred toherein as an M-sketch, that encompasses not only the l₂ (linearregression) and l₁ (robust regression) cases, but also the generalcategory of statistical M-estimators, which seek to combine theinsensitivity to outliers of l₁ regression with the low variance of l₂regression. In one embodiment, the estimators are specified by a cost ormeasure function G:

⁺, where G(x)=G(−x), G(0)=0, and G is non-decreasing in |x|. The resultis a new “norm” ∥y∥_(G)≡Σ_(iε[n]) G(y_(i)). It should be noted that ingeneral, the functions ∥∥_(G) are not true norms, but are sometimesreferred to as such herein. In one embodiment, the M-sketch isconstructed independently of G, and the same sketch may be used for allG. Thus, the input from the storage module 310 may be sketched by thesketching module 340, and a particular choice of a penalty function isdetermined later by the regression process 350.

In one example, the Huber norm is specified by a parameter τ>0, and itsmeasure function H is given by

${H(a)} \equiv \left( {\begin{matrix}{a^{2}\text{/}2\tau} & {{{if}{a}} \leq \tau} \\{{a} - {\tau\text{/}2}} & {otherwise}\end{matrix},} \right.$combining an l₂-like measure for small x with an l₁-like measure forlarge x. In one embodiment, the sketching module 340 provides anM-sketching scheme that comprises a sketching matrix Sε

^(t×n) and the vector processing module provides a scaling vector wε

^(t). In one example, for zε

^(t), let ∥z∥_(G,w) denote Σ_(iε[t]) w_(i)G(z_(i)). In one embodiment,the M-sketch SA of the matrix A has the property that∥SAx∥_(G,w)≈∥Ax∥_(G), under conditions that are enough to show that theG-regression problem min_(xε)

_(d) ∥Ax−b∥_(G) may be reduced in O(nnz(A)) time to the solution of asimilar kind of problem min_(xε)

_(d) ∥S(Ax−b)∥_(G,w) (with a side constraint), yielding an O(log n)approximation with constant probability over the random choice of S. Inone example embodiment, the scaling vector w is chosen deterministicallyby the vector processing module 330.

In one example embodiment, the regression process 350 changes themeasure on SAx from ∥SAx∥_(G,w) to a “clipped” version ∥∥_(Gc,w), inwhich only some of the larger coordinates of the sketch vector SAx areused when evaluating the norm. This variation allows a sketching processimplemented by the sketching module 340 that provides a constant-factorapproximation to min_(x)∥Ax−b∥_(G) with constant probability.

In one example, additional conditions on the function G beyond symmetryand monotonicity may be applied, such as that it grows no faster thanquadratically in x, and no slower than linearly. Subquadratic growth isappropriate for robust regression, to reduce the effect of large valuesin the residual Ax−b, relative to their effect in least-squares. Almostall proposed M-estimators satisfy these conditions. The latter linearlower bound on the growth of G holds for all convex G, and manyM-estimators have convex G. Moreover, the convexity of G implies theconvexity of ∥∥_(G), which is needed for computing a solution to theminimization problem in polynomial time. Convexity also impliessignificant properties for the statistical interpretation of theresults, such as consistency and asymptotic normality. In oneembodiment, G is not required to be convex for sketching results fromthe sketching module 340. Some M-estimators are not convex. Therefore,in one embodiment the regression process 350 reduces a large non-convexproblem, min_(x)∥Ax−b∥_(G), to a smaller non-convex problemmin_(x)∥S(Ax−b)∥_(G,w) of a similar kind. The linear growth lower boundimplies that is it unable to apply sketching by the sketching module 340to some proposed M-estimators; for example, the “Tukey” estimator, whoseG function is constant for large argument values, is not included inresults. However, in one or more embodiments, the regression process 350may get close, in the sense that at the cost of more computation, Gfunctions that grow arbitrarily slowly are handled by the regressionprocess 350.

In one or more embodiments, the regression process 350 obtains optimalO(nnz(A)+poly(d log n)) time approximation processes for theM-estimators, and the sketch non-trivially reduces the dimension of anyof these estimators other than the l_(p)-norms (which are a special caseof M-estimators). With conventional regression algorithms, for the L₁-L₂estimator in which G(x)=2(√{square root over (1+x²/2)}−1), the Fairestimator in which

${{G(x)} = {c^{2}\left\lbrack {\frac{x}{c} - {\log\left( {1 + \frac{x}{c}} \right)}} \right\rbrack}},$or the Huber estimator, no dimensionality reduction for the regressionproblem is known.

In one embodiment, for general M-estimators, the regression process 350provides constant-factor approximations for O(nnz(A)+poly(d log n))-timeone-pass implementations. In one embodiment, for the case of the Hubernorm, a sampling scheme based on a combination of Huber's l₁ and l₂properties may be used to obtain an implementation yielding anε-approximation with respect to the Huber norm of the residual. TheHuber norm is of particular interest, because it is popular andrecommended for many situations as it is robust in a certain sense, andbecause it has the useful computational and statistical propertiesimplied by the convexity and smoothness of its defining function. Thesmoothness makes it differentiable at all points, which can lead tocomputational savings over l₁, while enjoying the same robustnessproperties with respect to outliers. Additionally, l₁ for instance,treats small residuals “as seriously” as large residuals, whereas it isoften more appropriate to have robust treatment of large residuals andGaussian treatment of small residuals.

In one embodiment, the M-sketch yields an improved l₁ subspace embeddingthan conventional regression processes. For G(x)=|x|, the M-sketch valuemay be denoted as Σ_(iε[t]) w_(i)|z_(i)|=∥D(w)z∥₁, where D(w) is thediagonal matrix W with W_(ii)=w_(i); for G(x)=|x|,∥Sy∥_(G,w)=∥D(w)Sy∥_(G)=∥D(w)Sy∥₁. In one embodiment, with constantprobability, the M-sketch SA has Ω(1)∥Ax∥₁≦∥D(w)SAx∥₁≦O(d)∥Ax∥₁, for allxε

^(d), and hence has distortion O(d).

One or more embodiments using the regression process 350 have resultsinvolving a sketching dimension that is polynomial in d. In one exampleembodiment, with a sketching dimension that is poly(d/ε)^(1/ε),M-sketches have the dilation bound ∥D(w)Sy∥₁≦(1+ε)∥y∥₁ for a singlevector y, with constant probability. Together with the contraction bound∥D(w)Sy∥₁≧(1−ε)∥y∥₁, this is an improvement over a construction due toIndyk, for which the probability of success is ε. In one embodiment, theresults for l₁ are built on bounds on the contraction and dilation ofD(w)S, based on theorems that state that ∥D(w)SAx∥₁ will not be too muchsmaller than ∥Ax∥₁, with high probability, and cannot be too muchlarger, with constant probability.

In one embodiment, the sketching module 340 provides the M-sketch thatis novel for l₁, because unlike many conventional sketching schemes forl₁, the sketching module 340 does not use continuous random variablessuch as Cauchy or exponential. It should be noted that M-sketches doinvolve random variables that, like Cauchy, have an undefinedexpectation. It should also be noted that although there are sketches inthe streaming literature for estimating l₁ values with sign (±1) randomvariables, the conventional techniques use operations such as the medianto achieve high probability results (which is needed in one directionfor regression) and which do not make them amenable to optimization.

In one embodiment, the sketching module 340 interacts with the matrixprocessing module 320 and the vector processing module 330 forimplementing a variant construction, comprising a sequence of sketchingmatrices S₀, S₁, . . . S_(h) _(max) , for a parameter h_(max), eachcomprising a block of rows of the sketching matrix:

$S \equiv {\begin{bmatrix}S_{0} \\S_{1} \\S_{2} \\\vdots \\S_{h_{{ma}\; x}}\end{bmatrix}.}$When applied to vector yε

^(n), each S_(h) ignores all but a subset L_(h) of n/b^(h) entries of y,where b>1 is a parameter, and where those entries are chosen uniformlyat random (i.e., S_(h) may be factored as S′_(h)S″_(h), where S″_(h)ε

^(n/b) ^(h) ^(×n) samples row i of A by having column i with a single 1entry, and the rest zero, and S″_(h) has only n/b^(h) nonzero entries).

In one embodiment, each S_(h) implements a particular sketching schemefrom the sketching module 340 referred to as COUNT-SKETCH on its randomsubset. COUNT-SKETCH splits the coordinates of y into groups (“buckets”)at random, and adds together each group after multiplying eachcoordinate by a random ±1; each such sum constitutes a coordinate ofS_(h)y. COUNT-SKETCH has been shown to be a good subspace embedding forl₂, implying here that the matrix S₀, which applies to all thecoordinates of y=Ax, has the property that ∥S₀Ax∥₂ is a good estimatorfor ∥Ax∥₂ for all xε

^(d); in particular, each coordinate of S₀y is the magnitude of the l₂norm of the coordinates in the contributing group.

In one embodiment, the sketching module 340 provides construction, basedon l₂ embeddings, that are suitable for l₁, with ∥D(w)SAx∥₁ provided asan estimate of ∥Ax∥₁, where M-sketch is effective for that norm. In oneexample, an intuition is derived from considering the matrix S_(h)_(max) for the smallest random subset L_(h) _(max) of y=Ax to besketched; one can think of S_(h) _(max) y as one coordinate of y=Ax,chosen uniformly at random and sign-flipped. The expectation of ∥S_(h)_(max) y∥₁ is Σ_(iε[n])∥y_(i)∥₂/n=∥y∥₁/n; with appropriate scaling fromD(w), that smallest random subset yields an estimate of ∥y∥₁=∥Ax∥₁ (thisscaling is where the values w are needed). The variance of this estimateis too high to be useful, especially when the norm of y is concentratedin one coordinate, say y₁=1, and all other coordinates zero. For such ay, however, ∥y∥₂=∥y∥₁, so the base level estimator ∥S₀y∥₂ provides agood estimate. On the other hand, when y is the vector with allcoordinates 1/n, the variance of ∥S_(log) _(b) _(n)y∥₁ is zero, while∥S₀y∥₂=∥y∥₂ is quite inaccurate as an estimator of ∥y∥₁. So in theseextreme cases, the extreme ends of the M-sketch are effective. Theintermediate matrices S_(h) of the M-sketch serve to help with lessextreme cases of y-vectors.

In one embodiment, another intuition about the usefulness of theM-sketch for robust estimation is derived from considering data A and bcoming from a dataset (from the storage module 310) with inliers andoutliers: that is, there is x_(o) so that for most iε[n],A_(i:)x_(o)−b_(i) is very small, where A_(i:) denotes the i′th row of A,but for a few iε[n], A_(i:)x_(o)−b_(i) is very large, so that x_(o) isnot a minimizer of ∥Ax−b∥₁=Σ_(i)|A_(i:)x−b_(i)|. In one embodiment, thesketching module 340, in constructing M-sketch, may determine that itmay be likely that for several of the S_(h), the random subset L_(h)includes none of the coordinates in the small set corresponding tooutlier residuals, so that ∥D(w)S(Ax−b)∥₁=Σ_(h)∥D(w_(h))S_(h)(Ax−b)∥₁has many contributing terms for which the outliers have no effect. Inone embodiment, the vector processing module 330 builds the scalingvector w so that the weighted contributions from each S_(h) are roughlyequal; this reduces the effect of the outliers, and shift the solutiontoward x_(o).

In one embodiment, the regression process 350, for the Huber estimator,involves importance sampling of the (A_(i:), b_(i)), where a samplingmatrix S′ is obtained such that ∥S′(Ax−b)∥_(H,w) is a usefulapproximation to ∥Ax−b∥_(H). The sampling probabilities are based on acombination of the l₁ leverage score vector uε

^(n), and the l₂ leverage score vector u′ε

^(n). In on embodiment, the regression process 350 uses the l₁ vector ufrom the vector processing module 330 to obtain good samplingprobabilities for l₁ regression, and similarly for u′ and l₂. Since theHuber measure has a mixed l₁/l₂ character, a combination of l₁ and l₂scores are used to obtain good sampling probabilities for Huber. In oneembodiment, the Huber norm of a vector is bounded in terms of n, τ, andits l₁ and l₂ norms, and leads to a recursive sampling process.

In one or more embodiments, the general structure of the arguments is toshow, as discussed above, that for a fixed xε

^(d) there are bounds on:

-   -   contraction, so with high probability, ∥SAx∥_(G,w) is not too        much smaller than ∥Ax∥_(G); and    -   dilation, so with constant probability, ∥SAx∥_(G,w) is not too        much bigger than ∥Ax∥_(G). The asymmetry in probabilities means        that some results are out of reach, but still allows        approximation processing by the regression module 350 for        min_(x)∥Ax−b∥_(G). (the distinction is blurred between applying        S to A for vectors xε        ^(d), and to [A b] for vectors [x−1]). If the optimum x_(OPT)        for the original problem has ∥S(Ax_(OPT) b)∥_(G) that is not too        large, then it will be a not-too-large solution for the sketched        problem min_(x)∥S(Ax−b)∥_(G,w). If contraction bounds hold with        high probability for a fixed vector Ax, then a standard argument        shows that they hold for all x; thus, there will be no x that        gives a good, small ∥S(Ax−b∥_(G,w) and bad, large ∥Ax−b∥_(G).

In one embodiment, the contraction and dilation bounds are shown on afixed vector yε

^(n) by the vector module 330 splitting up the coordinates of y intogroups (“weight classes”) of roughly equal magnitude. In one embodiment,a weight class W is then analyzed by the regression process 350 withrespect to its cardinality: there will be some random subset (“level”)L_(ĥ) for which |W∩L_(ĥ)| is small relative to the number of rows ofS_(ĥ) (each row of S_(ĥ) corresponds to a bucket, as an implementationof COUNT-SKETCH), and therefore the members of W are spread out fromeach other, in separate buckets. This implies that each member of Wmakes its own independent contribution to ∥Sy∥_(G,w), and therefore that∥Sy∥_(G,w) will not be too small. Also, the level L_(ĥ) is chosen by theregression process 350 such that the expected number of entries of theweight class is large enough that the random variable |W∩L_(ĥ)| isconcentrated around its mean, and hence this contribution from W iswell-behaved.

In one embodiment, to show that ∥Sy∥_(G,w) will not be too big, it maybe shown that W will not contribute too much to levels other than the“Goldilocks” level L_(ĥ): for h<ĥ, for which |L_(h)∩W| is expected to belarge, the fact that members of W∩L_(h) will be crowded together in asingle bucket implies they will cancel each other out, roughly speaking;or more precisely, the fact that the COUNT-SKETCH buckets have anexpectation that is the l₂ norm of the bucket entries implies that if abucket contains a large number of entries from one weight class, thoseentries will make a lower contribution to the estimate ∥Sy∥_(G,w) thanthey did for L_(ĥ). In one embodiment, for h a bit bigger than ĥ,W∩L_(h) will likely be empty, and W will make no contribution to∥S_(h)y∥.

One or more embodiments may use the matrix processing module 320, thevector processing module 330 and the sketching module 340 to providesubspace embedding for polynomial kernel maps, which may be further usedby the regression process 350. Consider the polynomial kernel of degreeq (a positive real number) defined as k(x, y)=(

x, y

+c)^(q) for some constant c≧0. Without loss of generality it may beassumed henceforth that c=0 since the more general case can be reducedto it by adding a coordinate of value √{square root over (c)} to all ofthe data points. In one example, let φ(x) denote the function that mapsa d-dimensional vector x to the d^(q)-dimensional vector formed bytaking the product of all subsets of q coordinates of x, and let φ(A)denote the application of φ to the rows of A by the vector processingmodule 330 and the matrix processing module 320. In one example, φ isthe map that corresponds to the polynomial kernel, that is k(x, y)=

φ(x), φ(y)

, so learning with the polynomial kernel corresponds to using φ(A)instead of A.

In one embodiment, the sketching module 340 designs a distribution overd^(q)×O(3^(q) n²/ε²) sketching matrices S so that the mapping φ(A)·S maybe computed in O((A)q)+(3^(q) n/ε) time, where (A) denotes the number ofnon-zero entries of A. Further, with constant probability arbitrarilyclose to 1, for all n-dimensional vectors z, ∥z·φ(A)·S∥₂=(1±ε)∥z·φ(A)∥₂,that is, the entire row-space of φ(A) is approximately preserved.

In one embodiment, the sketching module 340 may implement differentapplications, such as SVM, approximate kernel PCA, PCR, etc. In oneexample, an n×k matrix V with orthonormal columns spans a rank-k(1+)-approximation of an n×d matrix A if∥A−VV^(T)A∥_(F)≦(1+ε)∥A−A_(k)∥_(F). In one embodiment, the results fromthe sketching module 340 for a constant degree q:

-   -   For polynomial kernel SVM, when d>>n, existing algorithms would        either take O(dn²) time to form the n×n Gram matrix φ(A)        φ(A)^(T), or at least d^(q) time to compute φ(A)S for a        sketching matrix S. Once these are computed, the SVM can be        solved in (n/ε) time using standard methods. In contrast, in one        embodiment the process implemented by the sketching module 340        takes O((A))+(n/ε) time, which improves the leading order term        from O(dn²) to O((A)), the latter always being at most O(nd) but        can be significantly smaller for sparse matrices.    -   In O((A))+n·(k/) time the matric processing module 320 computes        an n×k matrix V with orthonormal columns for which        ∥φ(A)−VV^(T)φ(A)∥_(F)≦(1+)∥φ(A)−[φ(A)]_(k)∥_(F), where        [φ(A)]_(k) denotes the best rank-k approximation to φ(A). The        k-dimensional subspace V of        ^(n) can be thought of as an approximation to the top k left        singular vectors of φ(A).    -   Once the sketching module 340 determines the subspace V, a low        rank approximation to φ(A) is obtained. To do this, in one        embodiment the sketching module 340 samples a subset R of (k/)        rows of φ(A) according to the leverage scores of V, obtaining a        (1+) rank-k approximation to φ(A) given by V·U·R, where U is a        k×(k/) matrix. Here the row space of R contains a good        approximation to the span of the top k principal components of        φ(A). In one example, it is desired to represent this matrix        implicitly, since otherwise R would have d^(q) columns. This is        possible by storing the sampled rows A_(i) rather than the        sampled rows φ(A_(i)). In one embodiment, the sketching module        340 obtains three matrices V, U, and R from the matric        processing module 320, for which        ∥φ(A)−V·U·φ(R)∥_(F)≦(1+)∥φ(A)−[φ(A)]_(k)∥_(F). This        representation is useful, since given a point yε        ^(d), the sketching module 340 can compute φ(R)·φ(y) quickly        using the kernel trick. The total time to compute the low rank        approximation is O((A))+(n+d)·(k/). This is considerably faster        than standard kernel PCA which would need O(n² d) time to        compute the Gram matrix φ(A)·φ(A)^(T).    -   The subspace V can be used to regularize and speed up various        learning processes with the polynomial kernel. In one example,        using the regression process 350 the subspace V may be used to        solve regression problems of the form min_(x)∥Vx−b∥₂, an        approximate form of principal component regression. This can        serve as a form of regularization, which is required as the        problem min_(x)∥(A)x−b∥₂ is underdetermined. A popular        alternative form of regularization is to use kernel ridge        regression, which requires O(n²d) operations. As(A)≦nd, one or        more embodiments using the regression process 350 are faster.

For sketching polynomial kernels, the TensorSketch algorithm iswell-known. TensorSketch combines the earlier CountSketch with a fastFourier transform (FFT) in a clever way. What was not known aboutTensorSketch, which is crucial in order to use it as an OSE forapplications with one or more embodiments, is whether it can be used toapproximately preserve entire subspaces of points. Indeed, a fixed pointvε

^(d) has the property that for the TensorSketch sketching matrix S,∥φ(v)·S∥₂=(1±)∥φ(v)∥₂ with constant probability. To obtain a highprobability bound using their results, a median of several independentsketches may be taken. Given a high probability bound (1−exp(−n)), onecan use a net argument to show that the sketch is correct for allvectors v in an n-dimensional subspace of

_(d). The median operation results in a non-convex embedding, and it isnot clear how to efficiently solve optimization problems in the sketchspace with such an embedding. Moreover, the fact that n independentsketches are needed to achieve probability 1−exp(−n) means the runningtime will be at least n·(A), whereas one or more embodiments only seekan (A) time.

Conventional applications show that CountSketch can be used to provide asubspace embedding, that is, simultaneously for all vεV,∥φ(v)·S∥₂=(1±)∥φ(v)∥₂. TensorSketch can be seen as a very restrictedform of CountSketch, where the additional restrictions enable its fastrunning time on inputs which are tensor products. In particular, thehash functions in TensorSketch are only 3-wise independent. CountSketchstill provides a subspace embedding if the entries are chosen from a4-wise independent distribution.

It should be noted that previous work on sketching the polynomial kernelsuffers from the drawback described above, that is, it provides noprovable guarantees for preserving an entire subspace, which is needed,e.g., for low rank approximation. This is true even of the sketchingmethods for polynomial kernels that do not use TensorSketch, as it onlyprovides tail bounds for preserving the norm of a fixed vector, and hasthe aforementioned problems of extending it to a subspace, i.e.,boosting the probability of error to be enough to union bound over netvectors in a subspace would require increasing the running time by afactor equal to the dimension of the subspace.

An unusual aspect is that for a TensorSketch matrix S can compute φ(A)·Svery efficiently. However, computing S·φ(A) is not known to beefficiently computable, and indeed, for degree-2 polynomial kernels thiscan be shown to be as hard as general rectangular matrix multiplication.In general, even writing down S·φ(A) would take a prohibitive d^(q)amount of time. Therefore, in one embodiment, the sketching module 340is designed to only sketch on one side of φ(A).

The goal in random features maps is to construct randomized feature mapsΨ(•) so that the Euclidean inner product <ΨP(u),Ψ(v)> closelyapproximates the value of k (u, v) where k is the kernel; the mapping Ψis dependent on the kernel. Theoretical analysis in this line ofresearch has focused so far on showing that <Ψ(u),Ψ(v)> is indeed closeto k (u, v). This is also the kind of approach that is used to analyzeTensorSketch. The problem with this kind of analysis is that it is hardto relate it to downstream metrics like generalization error. Incontrast, in one embodiment the sketching module 340 based on OSEsprovides a computational framework for analyzing the mappings, to reasonabout their downstream use, and to utilize various tools from numericallinear algebra in conjunction with them.

In one example, the CountSketch data structure is described as follows.Let B be the dimension of CountSketch, which is the number of its hashbuckets. When applied to d-dimensional vectors v, the data structure isspecified by a 2-wise independent hash function h: [d]→[B] and a 2-wiseindependent sign function s: [d]→−1, +1. The value in the i-th bucket ofthe data structure, for i=1, 2, . . . , B, is given by Σ_(j|h(j)=i)s(j)v_(j). CountSketch can be represented as a B×d matrix in which thej-th column contains a single non-zero entry s(j) in the h(j)-th row.

The TensorSketch data structure is described as follows. Suppose a pointvε

^(d) and so φ(v)ε

^(d) ^(q) are given. In one example, q 3-wise independent hash functionsh₁, . . . , h_(q): [d]→[B], and q 2-wise independent sign functions s₁,. . . , s_(q): [d] {+1, 1} are chosen. The hash function H: [d^(q)]→[B]and sign function S: [d^(q)]→{+1, −1} are defined as follows:H(i ₁ , . . . ,i _(q))=h ₁(i ₁)+h ₂(i ₂)+ . . . +h _(q)(i _(q))mod B,andS(i ₁ , . . . ,i _(q))=s ₁(i ₁)·s ₂(l ₁) . . . s _(q)(i _(q)).

It is well-known that if H is constructed this way, then it is 3-wiseindependent. Unlike a conventional technique, which implemented H as a2-wise independent, in the following a stronger property of H is needed.To hash the point v, TensorSketch computes the polynomialsp _(l)(x)=Σ_(i=0) ^(B−1) x ^(i)Σ_(j|h) _(l) _((j)=i) v _(j) ·s_(l)  (j),for l=1, 2, . . . , q. A calculation showsΠ_(l=1) ^(q)(x)mod(x ^(B)−1)=Σ_(i=0) ^(B−1) x ^(i)Σ_((j) ₁ _(, . . . ,j)_(q) _()|H(j) ₁ _(, . . . ,j) _(q) _()=i) v _(j) ₁ . . . v _(j) _(q) S(j₁ , . . . ,j _(q))that is, the product of the q polynomials can be viewed as a CountSketchdata structure applied to the vector v

^(q) with hash function H and sign function S.

A key insight of Pagh is that this product of polynomials can becomputed in O(qB log B) time using the Fast Fourier Transform. As ittakes O(q(v)) time to form the q polynomials, the overall time tocompute TensorSketch(v) is O(q((v)+B log B)). While an error analysis ofTensorSketch is provided in the literature, it is only for a singlepoint, and does not apply to properties of subspaces. In one or moreembodiments, a new analysis of TensorSketch is provided for subspaces.

In one example embodiment, for OSE, let Π be the m×d^(q) matrix suchthat TensorSketch(v) is φ(v)Π⁷, where φ(v) is the row feature vector ind^(q). It will be shown that Π^(T) is an oblivious subspace embeddingfor subspaces in d^(q) for appropriate values of m. Notice that Π hasexactly one non-zero entry per column. The index of the non-zero in thecolumn (i₁, . . . , i_(q)) is H(i₁, . . . , i_(q))=Σ_(j=1) ^(q)h_(j)(i_(j))mod m. Let δ_(i,j) be the indicator random variable ofwhether Π_(i,j) is non-zero. The signs of the non-zero entry in column(i₁, . . . , i_(q)) is S(i₁, . . . , i_(q))=Π_(j=1) ^(q) s_(j)(i_(j)).The following lemmas show that Π can be used to approximate a matrixproduct and is a subspace embedding, according to one embodiment. Let Aand B be matrices with d^(q) rows. For m=Θ(3^(q)/(²δ)):Pr[∥A ^(T)Π^(T) ΠB−A ^(T) B∥ _(F) ²≦² ∥A∥ _(F) ² ∥B∥ _(F) ²]≧1−δ.

Proof. Let C=A^(T)Π^(T)ΠB. ThenC _(u,u′)=Σ_(t=1) ^(m)Σ_(i,jε[d]) _(q) S(i)S(j)δ_(t,i)δ_(t,j) A _(i,u) B_(j,u′)=Σ_(t=1) ^(m)Σ_(i≠jε[d]) _(q) S(i)S(j)δ_(t,i)δ_(t,j) A _(i,u) B_(j,u′)+(A ^(T) B)_(u,u′).Thus, [C_(u,u′)]=(A^(T)B)_(u,u′). Next, the following is analyzed[((C−A^(T)B)_(u,u′))²]. This leads to:((C−A ^(T) B)_(u,u′))²=Σ_(t) ₁ _(,t) ₂ ₌₁ ^(m)Σ_(i) ₁ _(≠j) ₁ _(,i) ₂_(≠j) ₂ _(ε[d]) _(q) S(i ₁)S(i ₂)S(j ₁)S(j ₂)·δ_(t) ₁ _(,i) ₁ δ_(t) ₁_(,j) ₁ δ_(t) ₂ _(,i) ₂ δ_(t) ₂ _(,j) ₂ .

A_(i) ₁ _(,u)A_(i) ₂ _(,u)B_(j) _(i) _(,u′)B_(j) ₂ _(,u′).

For a term in the summation on the right hand side to contribute to[((S−A^(T)B)_(u,u′))²], it must be the case that[S(i₁)S(i₂)S(j₁)S(j₂)]=1. In other words, in each of the q coordinates,the 4 coordinates of i₁, i₂, j₁, j₂ must form matching pairs. In oneexample embodiment, let S₁ be the set of coordinates where i₁ and i₂agrees. Note that j₁ and j₂ must also agree in all coordinates in S₁.Let S₂⊂[q]\S₁ be the coordinates among the remaining where i₁ and j₁agrees. Finally, let S₃=[q]\(S₁∪S₂). All coordinates in S₃ of i₁ and j₂must agree. It can be rewritten as i₁=(a, b, c), i₂=(a, e, f), j₁=(g, b,f), j₂=(g, e, c) where a, gε[d]^(s) ¹ , b, eε[d]^(s) ² , c, fε[d]^(S) ³.

First it is shown that the contribution of the terms where i₁=i₂ ori₁=j₂ is bounded by

$\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m},$where A_(u) is the uth column of A and B_(u), is the u′th column of B.Indeed, consider the case i₁=i₂. As observed before, j₁=j₂ must be usedto get a non-zero contribution. Note that if t₁≠t₂, one always has δ_(t)₁ _(,i) ₁ δ_(t) ₂ _(,i) ₂ =0 as H(i₁) cannot be equal to both t₁ and t₂.Thus, for fixed i₁=i₂, j₁=j₂,

$\left. {\left\lbrack {\sum\limits_{t_{1},{t_{2} = 1}}^{m}{{S\left( i_{1} \right)}{S\left( i_{2} \right)}{S\left( j_{1} \right)}{{S\left( j_{2} \right)} \cdot \delta_{t_{1},i_{1}}}\delta_{t_{1},j_{1}}\delta_{t_{2},i_{2}}{\delta_{t_{2},j_{2}} \cdot}}}\quad \right.\mspace{45mu}{A_{i_{1},u} A_{i_{2},u} B_{j_{1},u^{\prime}} B_{j_{2},u^{\prime}}}} \right\rbrack = {\left\lbrack {\sum\limits_{t_{1} = 1}^{m}{\delta_{t_{1},i_{1}}^{2}\delta_{t_{1},j_{1}}^{2}A_{i_{1},u}^{2}B_{j_{1},u^{\prime}}^{2}}} \right\rbrack = {\frac{A_{i_{1},u}^{2}B_{j_{1},u^{\prime}}^{2}}{m}.}}$Summing over all possible values of i₁, j₁, one embodiment obtains thedesired bound of

$\frac{{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m}.$The case i₁=j₂ is analogous.

In one embodiment, the contribution of the terms are computed by thesketching module 340, where i₁≠i₂,j₁,j₂ i.e., there are at least three(3) distinct numbers among i₁, i₂, j₁, j₂. Notice that

$\left\lbrack {\delta_{t_{1},i_{1}}\delta_{t_{1},j_{1}}\delta_{t_{2},i_{2}}\delta_{t_{2},j_{2}}} \right\rbrack \leq {\frac{1}{m^{3}}.}$For fixed i₁, j₁, i₂, j₂, there are m² choices of t₁, t₂ so the totalcontribution to the expectation from terms with the same i₁, j₁, i₂, j₂is bounded by

${m^{2} \cdot \frac{1}{m^{3}} \cdot {{A_{i_{1},u}A_{i_{2},u}B_{j_{1},u^{\prime}}B_{j_{2},u^{\prime}}}}} = {\frac{1}{m}{{{A_{i_{1},u}A_{i_{2},u}B_{j_{1},u^{\prime}}B_{j_{2},u^{\prime}}}}.}}$Therefore,

${{\left\lbrack \left( \left( {C - {A^{T}B}} \right)_{u,u^{\prime}} \right)^{2} \right\rbrack \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{1}{m}{\sum\limits_{{{partition}\; S_{1}},S_{2},S_{3}}\mspace{11mu}{\sum\limits_{a,{g \in {\lbrack d\rbrack}^{s_{1}}},b,{e \in {\lbrack d\rbrack}^{s_{2}}},c,{f \in {\lbrack d\rbrack}^{s_{3}}}}{{A_{{({a,b,c})},u}B_{{({g,b,f})},u^{\prime}}A_{{({a,e,f})},u}B_{{({g,e,c})},u^{\prime}}}}}}}} \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}}{m}{\sum\limits_{a,b,c,g,e,f}{{A_{{({a,b,c})},u}B_{{({g,b,f})},u^{\prime}}A_{{({a,e,f})},u}B_{{({g,e,c})},u^{\prime}}}}}}} \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}}{m}{\sum\limits_{g,e,f}{\left( {\sum\limits_{a,b,c}A_{{({a,b,c})},u}^{2}} \right)^{1/2}\left( {\sum\limits_{a,b,c}{B_{{({g,b,f})},u^{\prime}}^{2}A_{{({a,e,f})},u}^{2}B_{{({g,e,c})},u^{\prime}}^{2}}} \right)^{1/2}}}}}} = {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{A_{u}}}}{m}{\sum\limits_{g,e,f}{\left( {\sum\limits_{b}B_{{({g,b,f})},u^{\prime}}^{2}} \right)^{1/2}\left( {\sum\limits_{a,c}{A_{{({a,e,f})},u}^{2}B_{{({g,e,c})},u^{\prime}}^{2}}} \right)^{1/2}}}}}},$where the second inequality follows from the fact that there are 3^(q)partitions of [q] into 3 sets. The other inequalities are fromCauchy-Schwarz. Continuing to bound this expression,

${{\leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{A_{u}}}}{m}{\sum\limits_{e}{\left( {\sum\limits_{b,g,f}B_{{({g,b,f})},u^{\prime}}^{2}} \right)^{1/2}\left( {\sum\limits_{a,c,g,f}{A_{{({a,e,f})},u}^{2}B_{{({g,e,c})},u^{\prime}}^{2}}} \right)^{1/2}}}}}} = {{{\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{{A_{u}}} \cdot {{B_{u^{\prime}}}}}}{m}{\sum\limits_{e}{\left( {\sum\limits_{a,f}A_{{({a,e,f})},u}^{2}} \right)^{1/2}\left( {\sum\limits_{g,c}B_{{({g,e,c})},u^{\prime}}^{2}} \right)^{1/2}}}}} \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{{A_{u}}} \cdot {{B_{u^{\prime}}}}}}{m}\left( {\sum\limits_{a,e,f}A_{{({a,e,f})},u}^{2}} \right)^{1/2}\left( {\sum\limits_{g,e,c}B_{{({g,e,c})},u^{\prime}}^{2}} \right)^{1/2}}}} = \frac{\left( {2 + 3^{q}} \right){{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m}}},$where the inequalities are applications of the Cauchy-Schwarzinequality.

Combining the above bounds, one has

$\left\lbrack \left( \left( {C\; - {A^{T}B}} \right)_{u,u^{\prime}} \right)^{2} \right\rbrack \leq {\frac{\left( {2 + 3^{q}} \right){{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m}.}$For m≧(2+3^(q))/(²δ), by the Markov inequality,∥a^(T)Π^(T)ΠB−A^(T)B∥_(F) ²≦² with probability 1−δ. Consider a fixedk-dimensional subspace V. If m=Θ(k²3^(q)/(²δ)), then with probability atleast 1−δ, ∥Πx∥=(1±)∥x∥ simultaneously for all xεV. Let B be an d^(q)×kmatrix whose columns form an orthonormal basis of V. Thus, one hasB^(T)B=I_(k) and ∥B∥_(F) ²=k. The condition that ∥Πx∥=(1±)∥x∥simultaneously for all xεV is equivalent to the condition that thesingular values of ΠB are bounded by 1±. For m=Θ(3^(q)/((/k)²δ)), withprobability at least 1−δ, one has∥B ^(T)Π^(T) ΠB−B ^(T) B∥ _(F) ²≦(/k)² ∥B∥ _(F) ⁴=²Thus, ∥B^(T)Π^(T)ΠB−I_(k)∥≦∥B^(T)Π^(T)ΠB−I_(k)∥_(F)≦. In other words,the squared singular values of ΠB are bounded by 1±, implying that thesingular values of ΠB are also bounded by 1±.

The Support Vector Machines (SVM) problem is to find a hyperplane thatmaximizes the 1-norm soft margin of a set of n input points A₁, . . . ,A_(n)εR^(1×d) with respective labels y_(i)ε{−1, 1}. In one exampleembodiment, first it is stated what is known for the non-kernelizedversion of the problem, and then the results for the polynomial kernelversion of the problem are stated. In one embodiment, the sketchingmodule 340 assumes d>>n. In the non-kernelized version of the problem,if Aε

^(n×d) is the matrix whose rows are the A₁, Yε

^(n×n) is the diagonal matrix with entries Y_(i,i)=y_(i), and αε

^(n) is a vector of Lagrange multipliers, then the dual formulation ofthe SVM optimization problem is to find a given by the followingoptimization problem:

${{{{\max\limits_{\alpha}{1^{T}\alpha}} - {\frac{1}{2}\alpha^{T}{YAA}^{T}Y\;\alpha\;{subjectto}\; 1^{T}Y\;\alpha}} = 0};{{{and}\mspace{14mu} 0} \leq \alpha \leq C}},$where 1^(T), 0, and C are vectors with the implied constant entry.

In one example embodiment, let α* be an optimal solution of the aboveproblem. The optimal separating hyperplane w* is given byw*=A^(T)Yα*=Σ_(i=1) ^(n)y_(i)α_(i)*A_(i), and the points A_(i) for whichα_(i)*>0 are called the support vectors. The geometric margin γ* is1/∥w*∥₂, where it can be shown for this optimal hyperplane that ∥w*∥₂²=Σ_(i=1) ^(n)α_(i)*. The data radius is given by r=min_(x)max_(A) _(i)∥A_(i)−x∥₂. The generalization performance of an SVM is typicallycharacterized by the ratio r/γ*. The smaller the ratio, the better.

Let ε be an accuracy parameter, and suppose S is a subspace embeddingfor the rowspace of A, that is, for all zε

^(n), ∥zAS∥₂=(1±ε)∥zA∥₂. Suppose one replaces the matrix A in the SVMoptimization problem with Ã=AS. Then if {tilde over (γ)}* is the marginobtained using Ã and {tilde over (r)} the radius obtained used Ã, then(1){tilde over (γ)}*≧(1−ε)·γ*, and (2) {tilde over (r)}≦(1+ε)r. One cancompute Ã=AS in O((A)) time, and the number of columns of S is (n/ε). Itfollows that Ã is an n×(n/ε) matrix, so the resulting SVM problem nolonger depends on d, and can be solved in (n/ε) time using standardmethods. Hence, the overall time is O((A))+(n/ε) and the guarantee isthat {tilde over (γ)}*≧(1−ε)·γ* and {tilde over (r)}≦(1+ε)r.

The difference with the polynomial kernel version of the problem is thatA is replaced with φ(A) in the optimization problem, increasing thedimension to d^(q). Note that α is still n-dimensional. Explicitlywriting down φ(A) and then computing φ(A)S for a sketching matrix Swould take at least d^(q) time, which is intractable. One can improvethis to O(n²d log q) time by using the kernel trick to computeφ(A)φ(A)^(T). One could then solve the resulting optimization problem in(n/ε) time. However, if d>>n, this can be quite slow. If A is sparse,then sparsity can potentially be used to compute φ(A)φ(A)^(T) faster,but it is hard to give a formal statement regarding the run-time as itdepends on the sparsity structure (e.g., a single dense column in A willcause φ(A)φ(A)^(T) to be full).

In one embodiment, the following is provided for polynomial kernel SVM.In O((A)q)+(3^(q)n/ε) time, one can solve the SVM optimization problemon the polynomial kernel of degree q so as to obtain a hyperplane withmargin {tilde over (γ)}*≧(1−ε)·γ* and radius {tilde over (r)}≦(1+ε)r.

Proof.

S is chosen to be a d^(q)×O(3^(q)n²/ε²) TensorSketch matrix. Using thefact that S provides a subspace embedding, it follows that (1) φ(A)S,and (2) φ(A)S can be computed in O(q((A)+3^(q)n³/ε² log(n/ε))) time andS has O(3^(q)n²/ε²) rows, according to one embodiment.

In one embodiment, the sketching module 340 provides approximate kernelPCA and low rank approximation. In one example, an n×k matrix V withorthonormal columns spans a rank-k (1+)-approximation of an n×d matrix Aif ∥A−VV^(T)A∥_(F)≦(1+ε)∥A−A_(k)∥_(F). One example starts by developinga process for finding an n×k matrix V which spans a rank-k(1+)-approximation of φ(A). In one embodiment, a process of thesketching module 340 is given as follows:

Process 1 k-Space:

1. Input: Aε^(n×d), ε (0,1], integer k.

2. Output: Vε^(n×k) with orthonormal columns which spans a rank-k(1+)-approximation to φ(A).

3. Set the parameters m=O(3^(q)k²+k/) and r=O(3^(q)m²/²).

4. Let S be a d^(q)×m TensorSketch and T be an independent d^(q)×rTensorSketch.

5. Compute φ(A) S and φ(A) T.

6. Let U be an orthonormal basis for the column space of φ(A)·S.

7. Let W be the m×k matrix containing the top k left singular vectors ofU^(T)φ(A)T.

8. Output V=UW.

Before proving the correctness of the process 1 above, two key lemmasare provided. Let Sε

^(d) ^(q) ^(×m) be a randomly chosen TensorSketch matrix withm=O(3^(q)k²+k/). Let UU^(T) be the n×n projection matrix onto the columnspace of φ(A)·S. Then if [U^(T) φ(A)]_(k) is the best rank-k(1+)-approximation to matrix U^(T) φ(A), one has:∥U[U ^(T)φ(A)]_(k)−φ(A)∥_(F)≦(1+O( ))∥φ(A)−[φ(A)]_(k)∥_(F).

Let UU^(T) be as above. Let Tε

^(d) ^(q) ^(×r) be a randomly chosen TensorSketch matrix withr=O(3^(q)m²/²), where m=O(3^(q)k²+k/). Suppose W is the m×k matrix whosecolumns are the top k left singular vectors of U^(T)φ(A)T. Then,∥UWW ^(T) U ^(T)φ(A)−φ(A)∥_(F)≦(1±)∥φ(A)−[φ(A)]_(k)∥_(F).

For the polynomial kernel of degree q, in O((A)q)+n·(3^(q)k/ε) time, theprocess 1 k-Space finds an n×k matrix V which spans a rank-k(1+)-approximation of φ(A). The output V=UW spans a rank-k(1+)-approximation to φ(A). It only remains to argue the timecomplexity. The sketches φ(A)·S and φ(A)·T can be computed inO((A)q)+n·(3^(q) k/) time by the sketching module 340. In n·(3^(q)k/)time, the matrix U can be obtained from φ(A)·S and the product U^(T)φ(A)T can be computed. Given U^(T) φ(A)T, the matrix W of top k leftsingular vectors can be computed in (3^(q)k/) time, and in n·(3^(q)k/)time the product V=UW can be computed. Hence the overall time isO((A)q)+n·(3^(q)k/ε).

In one embodiment, a low rank approximation to φ(A) may be found asfollows (Polynomial Kernel PCA and Low Rank Factorization). For thepolynomial kernel of degree q, in O((A)q)+(n+d)·(3^(q)k/ε) time, thesketching module 340 can find an n×k matrix V, a k×(k/ε) matrix U, and a(k/)×d matrix R for which∥V·U·φ(R)−A∥ _(F)≦(1+ε)∥φ(A)−[φ(A)]_(k)∥_(F).The success probability of the algorithm is at least. 6. Note that itmay be implied that the row space of φ(R) contains a k-dimensionalsubspace L with d^(q)×d^(q) projection matrix LL^(T) for which∥φ(A)LL^(T)−φ(A)∥_(F)≦(1+ε)∥φ(A)−[φ(A)]_(k)∥_(F), that is, L provides anapproximation to the space spanned by the top k principal components ofφ(A).

In one embodiment, regularizing learning with the polynomial kernel maybe provided by the sketching module 340. Consider learning with thepolynomial kernel. If d<<n it might be that even for low values of q onewill have d^(q)>>n. This makes a number of learning algorithmsunderdetermined, and increases the chance of overfitting. The problem iseven more severe if the input matrix A has a lot of redundancy in it(noisy features). To address this, many learning algorithms add aregularizer, e.g., ridge terms. In one embodiment, the sketching module340 regularizes by using rank-k approximations to the matrix (where k isthe regularization parameter that is controlled by the user).

In one embodiment, two different processes that can be regularized usingthe sketching module 340 are provided. In one example embodiment,approximate kernel PCR is provided. If d^(q)>n the linear regressionwith φ(A) becomes underdetermined and exact fitting to the right handside is possible, and in more than one way. One form of regularizationis PCR, which first uses PCA to project the data on the principalcomponent, and then continues with linear regression in this space. Inone example embodiment, the following approximate version of PCR isprovided. In the approximate PCR problem (Approximate PCR), an n×dmatrix A and an n×1 vector b is provided, and the goal is to find avector xε

^(k) and an n×k matrix V with orthonormal columns spanning a rank-k(1+)-approximation to A for which x=argmin_(x)∥Vx−b∥₂. Notice that if Ais a rank-k matrix, then Approximate PCR coincides with ordinary leastsquares regression with respect to the column space of A. While PCRwould require solving the regression problem with respect to the top ksingular vectors of A, in general finding these k vectors exactlyresults in unstable computation, and cannot be found by an efficientlinear sketch. This would occur, e.g., if the k-th singular value σ_(k)of A is very close (or equal) to σ_(k+1). In one example embodiment, thedefinition is relaxed to only require that the regression problem besolved with respect to some k vectors which span a rank-k(1+)-approximation to A.

In one embodiment, the following is provided for Approximate PCR.(Polynomial Kernel Approximate PCR.). For the polynomial kernel ofdegree q, in O((A)q)+n·(3^(q)k/ε) time the sketching module 340 cansolve the approximate PCR problem, namely, outputting a vector xε

^(k) and an n×k matrix V with orthonormal columns spanning a rank-k(1+)-approximation to φ(A), for which x=argmin_(x)∥Vx−b∥₂. An n×k matrixV with orthonormal columns spanning a rank-k (1+)-approximation to φ(A)may be found in O((A)q)+n·(3^(q)k/ε) time. At this point, the regressionproblem argmin_(x)∥Vx−b∥₂ may be determined by the regression process350 exactly in O(nk) time since the minimizer is x=V^(T)b.

In one example embodiment, canonical correlation analysis (CCA) providestwo matrices A, B and it is desired to find directions in which thespaces spanned by their columns are correlated. A formal linearalgebraic definition is as follows. Let A ε^(m×n) and B ε^(m×l), andassume that p=(A)≧(B)=q. The canonical correlations σ₁(A, B)≧σ₂ (A, B)≧. . . ≧σ_(q) (A, B) of the matrix pair (A, B) are defined recursively bythe following formula:

${{\sigma_{i}\left( {A,B} \right)} = {{\max\limits_{{x \in A_{i}},{y \in B_{i}}}{{\sigma\left( {{Ax},{By}} \right)}\mspace{14mu}\text{=}}}:{\sigma\left( {{Ax}_{i},{By}_{i}} \right)}}},{i = 1},\ldots\mspace{14mu},q$where

-   -   σ(u, v)=|u^(T)v|/(∥u∥₂∥v∥₂),    -   A_(i)={x: Ax≠0, Ax⊥{Ax₁, . . . , Ax_(i−1)}},    -   B_(i)={y: By≠0, By⊥{By₁, . . . , By_(i−1)}}.        The unit vectors Ax₁/∥Ax₁∥₂, . . . , Ax_(q)/∥Ax_(q)∥₂,        By₁/∥By₁|₂, . . . , By_(q)/∥By_(q)∥₂ are referred to as the        canonical or principal vectors. The vectors x₁/∥Ax₁∥₂, . . . ,        x_(q)/∥Ax_(q)∥₂, y₁/∥By₁|₂, . . . , y_(q)/∥By_(q)|₂ are referred        to as canonical weights (or projection vectors). Note that the        canonical weights and the canonical vectors are not uniquely        defined.

CCA finds correlations between any parts of the spectrum. Potentially,random correlation between noise might be found, skewing the results.The problem is aggravated in the kernel setting (where correlations arefound between φ(A) and φ(B)): if d^(q)>n then there is exact correlationbetween the subspaces. One common way to address this is to add ridgeterms. In one embodiment, the sketching module 340 considers a differentform of regularization: finding the correlations between the dominantsubspaces of A and B (their principal components). In one exampleembodiment, an approximate version is introduced. In the ApproximatePrincipal Component CCA Problem (Approximate PC-CCA), given an n×d₁matrix A and an n×d₂ matrix B, the goal is to find two n×k orthonormalmatrices U and V, where U spans a rank-k approximation to A and V spansa rank-k approximation to B, and output the CCA between U and V.

In one embodiment, for the polynomial kernel of degree q, inO(((A)+(B))q)+n·(3^(q)k/ε) time one embodiment can solve the approximatePC-CCA problem on φ(A) and φ(B). In one example, an n×k matrix V withorthonormal columns spanning a rank-k (1+)-approximation to φ(A) can befound in O((A) q)+n·(3^(q) k/ε) time, a n×k matrix U with orthonormalcolumns spanning a rank-k (1+)-approximation to φ(B) inO((B)q)+n·(3^(q)k/ε) time. At this point, in one embodiment, thesketching module 340 computes the CCA between U and V, which amounts tocomputing an SVD on U^(T)V. This takes O(nk²) time.

Two sets of example embodiments using the sketching module 340 have thegoal of demonstrating that the k-Space process 1 is useful as a featureextraction process. In one example, standard classification andregression datasets. In the first set of example embodiments, ordinaryl₂ regression is compared to approximate principal component l₂regression, where the approximate principal components are extractedusing k-Space (RLSC is used for classification). Specifically, asexplained above, one embodiment uses k-Space to compute V and then useregression on V (in one dataset we also add an additional ridgeregularization) using the regression process 350. To predict, it isnoticed that V=φ(A)·S·R⁻¹ W, where R is the R factor of φ(A)·S, soS·R⁻¹·W defines a mapping to the approximate principal components. Topredict on a matrix A_(t) the sketching module first computesφ(A_(t))·S·R⁻¹·W (using TensorSketch to compute φ(A_(t))·S fast) andthen multiply by the coefficients found by the regression by theregression process 350. In all the example embodiments, φ(•) is definedusing the kernel k (u, v)=(u^(T)v+1)³.

k-Space can be rather expensive both in time and in memory to compute,so in one example embodiment, feature extraction using only a subset ofthe training set is implemented. Specifically, first the dataset issampled and stored in the storage module 310, then k-Space is used tocompute the mapping S·R⁻¹·W. This mapping is applied to the entiredataset before performing regression by the regression process 350. Theresults of the first example embodiment is shown in Table 1. Sincek-Space is randomized, the mean and standard deviation of five (5) runsare shown in Table 1. For all datasets, learning with the extractedfeatures yields better generalized errors than learning with theoriginal features. Extracting the features using only a sample of thetraining set results in only slightly worse generalization errors.

A similar setup is used in the second set of example embodiments, nowusing linear SVM instead of regression (run only on the classificationdatasets). The results are shown in Table 2. Although the gap issmaller, it may be seen again that generally the extracted features leadto better generalization errors.

For Table 1 below: comparison of testing error with using regressionwith original features and with features extracted using k-Space. InTable 1, n is number of training instances, d is the number of featuresper instance and n_(t) is the number of instances in the test set.“Regression” stands for ordinary l₂ regression. “PCA Regression” standfor approximate principal component l₂ regression. “Sample PCARegression” stands approximate principal component l₂ regression whereonly n_(s) samples from the training set are used for computing thefeature extraction. In “PCA Regression” and “Sample PCA Regression” kfeatures are extracted. In k-Space process 1, m=O(k) and r=O(k) are usedwith the ratio between m and k and r and k as detailed in Table 1. Forclassification tasks, the percent of testing points incorrectlypredicted is reported. For regression tasks, ∥y_(p)−y∥₂/∥y∥ where y_(p)is the predicted values and y is the ground truth are shown.

TABLE 1 Sampled PCA PCA Dataset Regression Regression Regression mnist  14% Out of  7.9% ± 0.06% classification Memory k = 500, n_(s) = 5000 n= 60,000, m/k = 2 d = 784 n_(t) = 10,000 r/k = 4 cpu   12% 4.3% ± 1.0%3.6% ± 0.1% regression k = 200 k = 200, n_(s) = 2000 n = 6,554, m/k = 4m/k = 4 d = 21 n_(t) = 819 r/k = 8 r/k = 8 adult 15.3% 15.2% ± 0.1% 15.2% ± 0.03% classification k = 500 k = 500, n_(s) = 5000 n = 32,561,m/k = 2 m/k = 2 d = 123 n_(t) = 16,281 r/k = 4 r/k = 4 census  7.1% 6.5%± 0.2% 6.8% ± 0.4% regression k = 500 k = 500, n_(s) = 5000 n = 18,186,m/k = 4 m/k = 4 d = 119 n_(t) = 2,273 r/k = 8 r/k = 8 λ = 0.001 λ =0.001 usps 13.1% 7.0% ± 0.2% 7.5% ± 0.3% classification k = 200 k = 200,n_(s) = 2000 n = 7,291, m/k = 4 m/k = 4 d = 256 n_(t) = 2,007 r/k = 8r/k = 8

For Table 2: comparison of testing error with using SVM with originalfeatures and with features extracted using k-Space. In Table 2, n isnumber of training instances, d is the number of features per instanceand n_(t) is the number of instances in the test set. “SVM” stands forlinear SVM. “PCA SVM” stands for using k-Space to extract features, andthen using linear SVM. “Sample PCA SVM” stands for using only n_(s)samples from the training set are used for computing the featureextraction. In “PCA SVM” and “Sample PCA SVM” k features are extracted.In k-Space process 1, m=O(k) and r=O(k) with the ratio between m and kand r and k are used as detailed in Table 2. For classification tasks,the percent of testing points incorrectly predicted is reported.

TABLE 2 Sampled Dataset SVM PCA SVM PCA SVM mnist 8.4% Out of 6.1% ±0.1% classification Memory k = 500, n_(s) = 5000 n = 60,000, m/k = 2 d =784 n_(t) = 10,000 r/k = 4 adult 15.0% 15.1% ± 0.1%  15.2% ± 0.1% classification k = 500 k = 500, n_(s) = 5000 n = 32,561, m/k = 2 m/k = 2d = 123 n_(t) = 16,281 r/k = 4 r/k = 4 usps 8.3% 7.2% ± 0.2% 7.5% ± 0.3%classification k = 200 k = 200, n_(s) = 2000 n = 7,291, m/k = 4 m/k = 4d = 256 n_(t) = 2,007 r/k = 8 r/k = 8

In one embodiment, it should be noted that it is not a goal to show thatthe k-Space process 1 is the best feature extraction algorithm of theclassification algorithms considered (RLSC and SVM), but rather that itcan be used to extract features of higher quality than the original one.In fact, in the example embodiments, while for a fixed number ofextracted features, k-Space process 1 produces better features thansimply using TensorSketch, it is also more expensive in terms of time.If that additional time is used to do learning or prediction withTensorSketch with more features, an overall better generalization errormay be obtained. However, feature extraction is widely applicable, andthere may be cases where having fewer high quality features isbeneficial, e.g., performing multiple learning on the same data, or avery expensive learning tasks.

FIG. 4 is a block diagram showing a process 400 for sketching forM-estimators for performing regression, in accordance with an embodimentof the invention. In one embodiment, in block 410, one or more sets ofinput data (e.g., samples/sampling, collected data, etc.) is provided.In one embodiment, in block 420 a matrix A and a vector b are generatedusing the one or more sets of input data (e.g., using the matrixprocessing module 320 and vector processing module 330, FIG. 3). In oneembodiment, in block 430 a processor device is used for processing thematrix A and the vector b based on a randomized sketching matrix S(e.g., as described in relation to FIG. 3). In one embodiment, in block440 a vector x is provided that minimizes a normalized measure functionbased on the matrix A and the vector b. In one embodiment, in block 450a relationship between the one or more sets of input data is determinedbased on the vector x.

In one embodiment, process 400 may include determining the vector xbased on determining the min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positiveintegers, and the matrix A comprises an n×d matrix. In one embodiment,the sketching matrix S is used to estimate ∥Ax∥_(G), for all xε

^(d). In one embodiment, the sketching matrix S is used for a pluralityof M-estimators requiring O(nnz(A)+poly(d log n)) time, wherein theprocessor device makes one pass over the matrix A for determining aregression solution with a residual error within a constant factor of aspecified tolerance and with constant probability, wherein nnz comprisesa number of non-zero entries of the matrix A.

In one embodiment, the process 400 may include providing an OSE forembedding a space induced by a nonlinear (polynomial) kernel for thematrix A. In one embodiment, embedding the space comprises applying amap φ(A) to rows of A, wherein the map φ(A) corresponds to the nonlinearkernel. In one embodiment, the process 400 may include determining thevector x based on determining min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positiveintegers, and the matrix A comprises an n×d matrix

FIG. 5 is a block diagram showing a process 500 for sketching using anOSE for embedding a space induced by a nonlinear kernel, in accordancewith an embodiment of the invention. In one embodiment, in block 510,one or more sets of input data (e.g., samples/sampling, collected data,etc.) is provided. In one embodiment, in block 520 a matrix A and avector b are generated using the one or more sets of input data (e.g.,using the matrix processing module 320 and vector processing module 330,FIG. 3). In one embodiment, in block 530, an OSE is provided forapplying a map φ(A) to rows of the matrix A, where the map φ correspondsto a nonlinear kernel. In one embodiment, in block 540 a processordevice is used for processing map φ(A) and the vector b based on arandomized sketching matrix S (e.g., as described in relation to FIG.3). In one embodiment, in block 550 a vector x is provided thatminimizes a normalized measure function based on the map φ(A) and thevector b. In one embodiment, in block 560 a relationship between the oneor more sets of input data is determined based on the vector x.

In one embodiment, in process 500 determining the vector x comprisesdetermining the min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positiveintegers, and the matrix A comprises an n×d matrix. In one embodiment,in process 500 the sketching matrix S is used to estimate ∥φ(A)x∥_(G),for all xε

^(d). In one embodiment, in process 500 the map φ(A) maps each row of Ato a higher dimensional space using a polynomial kernel of a givendegree.

In one embodiment, process 500 may further include computing an implicitlow rank decomposition of φ(A), and performing PCR with respect to anapproximation of top k left singular vectors of φ(A), where k is apositive integer.

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with aninstruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent invention may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. The program code may execute entirely on theuser's computer, partly on the user's computer, as a stand-alonesoftware package, partly on the user's computer and partly on a remotecomputer or entirely on the remote computer or server. In the latterscenario, the remote computer may be connected to the user's computerthrough any type of network, including a local area network (LAN) or awide area network (WAN), or the connection may be made to an externalcomputer (for example, through the Internet using an Internet ServiceProvider).

Aspects of the present invention are described below with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

References in the claims to an element in the singular is not intendedto mean “one and only” unless explicitly so stated, but rather “one ormore.” All structural and functional equivalents to the elements of theabove-described exemplary embodiment that are currently known or latercome to be known to those of ordinary skill in the art are intended tobe encompassed by the present claims. No claim element herein is to beconstrued under the provisions of 35 U.S.C. section 112, sixthparagraph, unless the element is expressly recited using the phrase“means for” or “step for.”

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements in the claims below are intended toinclude any structure, material, or act for performing the function incombination with other claimed elements as specifically claimed. Thedescription of the present invention has been presented for purposes ofillustration and description, but is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the invention. Theembodiment was chosen and described in order to best explain theprinciples of the invention and the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

What is claimed is:
 1. A method comprising: obtaining one or more setsof input data; performing, by a processor device, a sketching processincluding: generating a matrix A and a vector b using the one or moresets of input data; creating, by the processor device, a randomizedsketching matrix S based on the matrix A; and determining, by theprocessor device, a vector x that minimizes a normalized measurefunction based on the matrix A and the vector b; and performing, by theprocessor device, a relationship process using the sketching process fordetermining a relationship between the one or more sets of input databased on the vector x.
 2. The method of claim 1, wherein determining thevector x comprises determining the min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, and ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers,and the matrix A comprises an n×d matrix.
 3. The method of claim 2,further comprising: estimating, by the processor device, ∥Ax∥_(G), forall xε

^(d) based on the sketching matrix S.
 4. The method of claim 3, whereinthe relationship process further comprises using the sketching matrix Sfor a plurality of M-estimators requiring O(nnz(A)+poly(d log n)) time,the processor device performs one pass over the matrix A for determininga regression solution with a residual error within a constant factor ofa specified tolerance and with constant probability, and nnz comprises anumber of non-zero entries of the matrix A.
 5. The method of claim 1,further comprising: embedding, by the processor device, a space inducedby a nonlinear polynomial kernel for the matrix A using an oblivioussubspace embedding (OSE).
 6. The method of claim 5, wherein embeddingthe space comprises applying, by the processor device, a map φ(A) torows of A, and the map φ(A) corresponds to the nonlinear polynomialkernel.
 7. The method of claim 6, wherein determining the vector xcomprises determining min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, and ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers,and the matrix A comprises an n×d matrix.
 8. A computer program productfor determining relationships between data, the computer program productcomprising a computer readable storage medium having program codeembodied therewith, the program code executable by a processor to:receive, by the processor, one or more sets of input data; perform, bythe processor, a sketching process executable by the processor to:generate a matrix A and a vector b based on the one or more sets ofinput data; create, by the processor, a randomized sketching matrix Sbased on the matrix A; and determine, by the processor, a vector x thatminimizes a normalized measure function based on the matrix A and thevector b; and perform, by the processor, a relationship process usingthe sketching process to determine a relationship between the one ormore sets of input data based on the vector x.
 9. The computer programproduct of claim 8, wherein determining the vector x comprises furtherprogram code executable by the processor to determine themin_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, and ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers,and the matrix A comprises an n×d matrix.
 10. The computer programproduct of claim 9, wherein the program code further executable by aprocessor to estimate ∥Ax∥_(G), for all xε

^(d) based on the sketching matrix S.
 11. The computer program productof claim 10, wherein the relationship process further comprises usingthe sketching matrix S for a plurality of M-estimators requiringO(nnz(A)+poly(d log n)) time, the processor makes one pass over thematrix A for determining a regression solution with a residual errorwithin a constant factor of a specified tolerance and with constantprobability, and nnz comprises a number of non-zero entries of thematrix A.
 12. The computer program product of claim 8, wherein theprogram code executable by the processor further to: embed, by theprocessor, a space induced by a nonlinear polynomial kernel for thematrix A using an oblivious subspace embedding (OSE).
 13. The computerprogram product of claim 12, wherein the program code executable by theprocessor further to: embed the space by applying a map φ(A) to rows ofA, and the map φ(A) corresponds to the nonlinear polynomial kernel. 14.The computer program product of claim 8, wherein determine the vector xcomprises determine min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, andthe matrix A comprises an n×d matrix.
 15. A method comprising: receivingone or more sets of input data; performing, by a processor device, asketching process including: generating a matrix A and a vector b usingthe one or more sets of input data; creating, by the processor device, arandomized sketching matrix S based on the matrix A; applying, by theprocessor device, a map φ(A) to rows of the matrix A using an oblivioussubspace embedding (OSE), wherein the map φ corresponds to a nonlinearkernel; processing, using the processor device, the map φ(A) and thevector b based on the randomized sketching matrix S; and determining avector x that minimizes a normalized measure function based on the mapφ(A) and the vector b; and performing, by the processor device, arelationship process using the sketching process for determining arelationship between the one or more sets of input data based on thevector x.
 16. The method of claim 15, wherein determining the vector xcomprises determining the min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, and ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers,and the matrix A comprises an n×d matrix.
 17. The method of claim 16,further comprising estimating, by the processor device, ∥φ(A)x∥_(G), forall xε

^(d) based on the sketching matrix S.
 18. The method of claim 17,wherein the map φ(A) maps each row of A to a higher dimensional spaceusing a polynomial kernel of a given degree.
 19. The method of claim 18,further comprising: computing, by the processor device, an implicit lowrank decomposition of φ(A).
 20. The method of claim 19, furthercomprising: performing, by the processor device, a principal componentregression (PCR) process with respect to an approximation of top k leftsingular vectors of φ(A), wherein k is a positive integer.